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Abstract 

Interest in the patient's views of his or her illness and treatment has increased dramatically. However, our ability to 
appropriately measure such issues lags far behind the level of interest and need. Too often such measurement is 
considered to be a simple and trivial activity that merely requires the application of common sense. However, 
good quality measurement of patient-reported outcomes is a complex activity requiring considerable expertise and 
experience. This review considers the most important issues related to such measurement in the context of 
chronic disease and details how instruments should be developed, validated and adapted for use in additional 
languages. While there is often consensus on how best to undertake these activities, there is generally little 
evidence to support such accord. The present article questions these orthodox views and suggests alternative 
approaches that have been shown to be effective. 



Opinion 

Questionnaires are ubiquitous throughout life these 
days. Medicine is no different, with the patient rightly 
seen as a client whose views are crucial to gaining a 
clear understanding of anything from the quality of 
service provision to treatment effectiveness. Patients 
are increasingly regarded as one of the key stakeholder 
groups in medicine that, alongside regulators, payers 
and clinicians, can influence access to and reimburse- 
ment for pharmaceutical products. Much of the infor- 
mation on patient views is collected via questionnaires. 
Many, if not most, of these are hastily prepared by 
clinical or other professionals wishing to answer speci- 
fic questions that they consider to be important. 
Unfortunately, the development and application of 
such questionnaires is often regarded as a matter of 
'common sense' requiring little scientific consideration. 
However, in this area of research, common sense is 
commonly nonsense! In this article, I argue that many 
of the questionnaires patients are asked to complete in 
clinical practice and trials are of poor quality and col- 
lect information that is of scant relevance to the 



patient. In this respect, they are ultimately of limited 
value. 

Questionnaires used to elicit information from 
patients are now commonly referred to as patient- 
reported outcome measures (PROMs). A PROM is far 
more than a mechanism for gathering opinion. They are 
designed to measure a specific concept (that is, a con- 
struct) in a standardised way. Thus, they provide a 
means of quantifying qualitative information. In reality, 
there is a great deal of science involved in producing 
good-quality PROMs. Indeed, the PROM development 
process requires careful consideration of several key 
issues as set out in Figure 1. 

When selecting a PROM, it is crucial that evidence is 
available to show that each of these key issues has been 
considered and addressed during instrument develop- 
ment and testing. Where measures are required for use 
in different languages or cultures, there are additional 
considerations: Have appropriate methods been 
employed to translate the questionnaire? Have new lan- 
guage versions been tested to ensure that they are both 
suitable for local patients and have adequate psycho- 
metric and scaling properties? 



Correspondence: smckenna(a)galen-research.com 

Director of Research, Galen Research Ltd, Enterprise House, Manchester 

Science Park, Lloyd Street North, Manchester Ml 5 6SE, UK 

O© 201 1 McKenna; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
BIoIVIGCI C6ntr3l Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



McKenna BMC Medicine 201 1, 9:86 
http://www.biomedcentral.eom/1 741 -701 5/9/86 



Page 2 of 1 2 



What will questionnaire 
measure? 



•What construct should be measured? 
•How is construct defined? 



What should 
questionnaire include? 










Suitabilityfor potential 
respondents? 










Does questionnaire 
measure desired 
construct? 







•How should content be produced (item generation)? 
•How is best item set selected [item reduction)? 



•Does it appear to measure construct appropriately (face validity)? 
•Is content a reasonable representation of patient experience (content 
validity)? 



1 



•Do items measure intended construct [construct validity]? 
•Can item scores be added together validly [scatabiifty]? 
•Is level of measurement error acceptable [retfabflity]? 
•Is scale able to measure real change in construct [responsiveness]? 
•Are data collected free from biases unrelated to construct [differentia! 
item functioning; DiF]? 



Will questionnaire be 
required for use in new 
language or culture? 



•Have appropriate translation methods been employed? 
•Have new language versions been tested to ensure suitabilityfor local 
patients ? 

•Do adaptations have adequate psychometric and scaling properties? 



Figure 1 Key considerations for patient-reported outcome questionnaire development. The major factors that should be considered when 
selecting a patient-reported outcome measurement (PROM) for use in clinical studies are shown. These emphasise the importance of ensuring 
that the PROM addresses the required outcome, that it has been carefully developed and that all versions developed (including language 
adaptations) are of good quality. 

v J 



What do PROMs measure? 

Patient-reported outcome (PRO) is an umbrella term 
that covers a range of different types of outcome (see 
Table 1). Symptoms and functioning are clearly defined 
as impairments and disability in the International Classi- 
fication of Impairment, Disability and Handicap [1]. Dis- 
ability is now referred to as activity [2]. PROMs should 
not be confused with clinical rating scales, where a clini- 
cian completes a form to rate disease severity or treat- 
ment effects. The common link between PROMs is that 
they collect information directly from the patient with- 
out interpretation by clinicians or others [3-5]. However, 
this does not imply that all PROMs measure issues that 
are of concern or importance to the patient. 



Measures of symptoms, activity limitations, health sta- 
tus, health-related quality of life (HRQL) and quality of life 
(QoL) completed by patients are all examples of PROMs 
[3,6]. More recently, PROMs have also been used in clini- 
cal trials to address issues of patient satisfaction, compli- 
ance with treatment and treatment preferences. Each of 
these outcomes represents a distinct measurement con- 
struct and these should not be confused. Indeed, the term 
'PRO' was coined in about 2000 specifically to avoid the 
misuse of, and the confusion surrounding, the term 'qual- 
ity of life'. It had been (and occasionally still is) common 
practice for instrument developers to refer to any PROM 
as a measure of QoL, even where it was clearly designed 
to address a different outcome construct [7]. 
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Table 1 Types of patient-reported outcome measures^ 



Type of PRO 



Constructs assessed 



Examples of coverage/domains 



Symptoms 



Functioning 



Healtli status 
(HRQL) 

Quality of life 
Utility^ 



Impairment 



Disability/activity 



Combination of impairment, disability and, occasionally, 
some QoL 

QoL 

Combination of impairment, disability or QoL 



Pain 

Fatigue 

Anxiety 

Depression 

Incontinence 

Bathing 
Dressing 
Walking 
Ability to work 

Activities of daily living (such as personal care) 
Symptoms and functions as above 

■ Needs-based QoL 

■ Symptoms and functions as above 

■ Activities of daily living (such as personal care) 

■ Needs-based QoL 



^PRO, patient-reported outcome; HRQL, health-related quality of life; QoL, quality of life; responses to the questionnaire are used to generate a perceived utility 
score. 



To summarise, PROMs that assess symptoms (that is, 
impairment) or functional Umitations (such as disabiUty 
or activity limitations) address issues that are of primary 
interest to the clinician, as these are most indicative of 
disease severity. HRQL measures are made up of scales 
that assess symptoms and activity limitations. In con- 
trast, QoL scales determine outcomes that are of pri- 
mary concern to the patient. Severe impairment or 
functional limitations may well also be of concern to the 
patient, but only where these affect QoL. QoL scales 
should provide a holistic assessment of the impact of 
disease and its treatment on the patient. 

Unfortunately, when describing PROMs, few authors 
state the model used to generate its content. Instead, it 
is common practice to describe a range of different con- 
structs that should be measured. However, there is lim- 
ited agreement about the specific constructs that should 
be assessed [8]. Of the measurement models described 
in the literature, the most widely applied QoL model is 
concerned with the extent to which disease and its 
treatment prevent an individual from meeting his or her 
needs [9-13]. This approach argues that individuals are 
driven or motivated by their needs and that the fulfil- 
ment of these provides satisfaction and a good QoL [9]. 
Consequently, QoL is good when most needs are ful- 
filled and poor when few needs are satisfied. Function- 
ing is important only insofar as it permits need 
fulfilment. For example, employment has the objective 
of earning a salary, but it also leads to the fulfilment of 
a number of basic human needs (see Figure 2). Satisfac- 
tion of these needs leads to a good QoL. 

Measures of satisfaction differ from HRQL and QoL, 
as they address the process of treatment rather than its 
outcome. These measures are concerned with factors 
such as acceptability of the drug and the quality of care. 



Some PROMs, such as the EQ-5D [14] and the Health 
Utilities Index [15], can be used to generate preference 
or utility assessments. Patients' responses to these ques- 
tionnaires can be converted to estimate the value of that 
person's life on a scale ranging from death (scored 0) 
through to perfect health (rated 1). As these PROMs 
consist of items enquiring about impairments and func- 
tional limitations, they are measures of HRQL. Such 
PROMs are referred to in this article as measures of uti- 
lity, as they are widely used for this purpose in clinical 
trials. Recently, utility valuations have been derived from 
responses to disease-specific QoL instruments, providing 
more accurate measurement of this construct [16-20]. 

This article concentrates primarily on PROMs that 
assess more than a single symptom (such as pain or 



Income 




Employment 






1 UJi- 



Identity 
Status 

Time structure 
Shared goals 
Socialisation 



Objective 



Function 



Needs fulfilled 



Figure 2 Employment-related needs. The relationships between 
function, objective and needs satisfaction are shown. Here 
employment is a function undertaken to obtain income. However, 
undertaking the function leads to the satisfaction of a range of 
needs (some of which are listed). Quality of life (QoL) is the result of 
satisfaction of the needs rather than earning an income per se. 
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fatigue) or function (such as work or communication) 
that do not measure satisfaction or utiUty and that are 
used in cUnical trials or for monitoring patients in cUni- 
cal practise. 

Generic versus disease-specific PROs 

Regardless of the construct assessed, a PROM may be 
generic or disease-specific. As its name implies, a gen- 
eric instrument is intended to be used in any disease 
population. Some of the more widely known PROMs 
are generic. Examples include the Sickness Impact Pro- 
file (SIP) [21], the Nottingham Health Profile (NHP) 
[22], the Short Form 36 (SF-36) [23] and the EQ-5D 
[14]. Such instruments usually assess several domains 
and provide a profile of scores. 

Traditionally, generic instruments were used to pro- 
vide comparisons between diseases or to compare data 
with population normative values. However, the results 
of differential item functioning analyses show that such 
comparisons are scientifically flawed, as questionnaire 
items work in different ways with different patient 
groups [24-27]. This means that as generic measures 
cannot allow valid comparisons to be made between the 
impacts of different diseases or between healthy and dis- 
eased populations, they no longer have a clear role in 
measuring health outcomes. 

A second major problem with generic instruments is 
that they are not designed to capture areas of concern 
to specific patient populations. This raises two issues. 
First, they are likely to include items that are irrelevant 
for certain patient groups. For example, questions that 
address physical functioning or bodily pain will only be 
relevant if they are a feature of the disease under study. 
Asking patients to answer questions that are irrelevant 
is likely to alienate respondents and increase the poten- 
tial for missing or inaccurate responses. Second, they 
are likely to miss issues that are a specific feature of the 
disease under study. As a result, generic scales lack the 
responsiveness needed to measure change associated 
with effective treatment. 

As a result of the acknowledgement of the problems 
with generic measures, they are no longer developed. 
They have partly been replaced by item-banking 
approaches whereby a subset of relevant items for a spe- 
cific condition is selected to assess patients. The most 
widely used generic measures are relatively dated. The 
SIP and NHP were developed in the early 1970s. The 
five items included in the EQ-5D were taken from exist- 
ing generic measures and so are of the same vintage. 
Most of the items in the SF-36 were adapted from 
instruments that had been used for 20 to 40 years pre- 
vious to 1992 [23]. The way in which patients concep- 
tualise their problems and the language with which they 
express themselves can change within a generation. 



Moreover, certain issues may become less important 
with time. For example, lack of mobility may be com- 
pensated for by advances in technology. Furthermore, 
the generic health status instruments have not benefited 
from improvements in test construction methodology 
and scaling techniques. Consequently, the reliability and 
responsiveness of the generic measures fall far short of 
what is required for instruments included in clinical 
trials. 

Disease-specific questionnaires are developed to 
address those aspects of outcome that are important for 
a particular patient population. In the case of needs- 
based QoL measures, this is achieved by generating the 
items by means of qualitative interviews with relevant 
patients and by thoroughly testing the validity of the 
item set with new populations of patients. More com- 
plex analyses are also employed to ensure that all items 
actually assess the construct being measured [13,28]. 
Thus, for a well-developed measure, patients will only 
be asked questions that are relevant, meaningful and 
acceptable to them. Addressing the relevant areas of 
concern for the group under study maximizes respon- 
dent acceptability and minimises missing data. Conse- 
quently, disease-specific instruments possess greater 
potential for showing differences between competing 
therapies. A criticism that is often made of the use of 
disease-specific scales is the lack of comparability across 
diseases. This is a particular issue for reimbursement 
authorities, who are required to assess the comparative 
benefits of treatment reimbursement across disease 
areas. However, as noted above, the use of generic scales 
does not provide a valid basis for comparison across dis- 
eases. Recent advances in scaling theory are being 
applied to address this issue. It is now feasible to use 
disease-specific measures to make across-disease com- 
parisons, providing the instruments are based on the 
same model of the construct measured. 

Use of PROMs in medicine 

PROMs have been used in a variety of ways in clinical 
practice and research. At the level of the individual 
patient, they can be used to assess disease severity and 
response to interventions. Here the measures can be 
used to help in decision making at the physician level. 
PROs are also widely used in clinical trials to determine 
whether an intervention is effective (for example, when 
evaluating treatments for pain) and also whether 
patients feel the benefit of treatments. Evidence pro- 
vided by PROMs can thus aid decisions made by regula- 
tory bodies regarding the utility of new products. Figure 
3 shows schematically the different types of PROMs 
used in medicine. The diagram reflects the fact that 
most PROMs currently used assess HRQL rather than 
QoL or patient satisfaction. The assessment of QoL is 
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Figure 3 Types of PROMs currently used in medical research 

The range of different types of patient-reported outcomes (PROs) is 
shown. The most commonly used PROMs assess symptoms and/or 
functional limitations. These are commonly referred to as health- 
related quality of life (HRQL) measures. The commonly used 
measures which generate utility values also ask about symptoms 
and/or functional limitations. Patient satisfaction is generally 
concerned with issues such as the process of treatment and 
relationships with clinical staff. QoL measures address need- 
fulfilment rather than symptoms and/or functional limitations. 



relatively rare, despite the term being widely used in 
research reports and publications. 

The widespread use of HRQL measures gives some 
cause for concern. First, the term itself is misleading 
and unhelpful insofar as it implies that QoL is being 
measured. Bradley [29] argued that 'clinicians may be 
misled into thinking that findings based on a [HRQL] 
instrument indicate that treatments do not damage QoL 
when all the data reveal is that treatments do not 
damage perceived health' (page 7). Indeed, the focus on 
HRQL provides a framework for assessing interventions 
predominantly from a clinical rather than a patient per- 
spective. Second, HRQL scales do not necessarily 
address issues of primary concern to the patient. The 
focus of HRQL on the patient's ability to fulfil roles 
deemed 'normal' takes no account of the fact that 
patients with chronic disease adapt to their condition, 
often by replacing activities that they can no longer per- 
form with others that are equally satisfying. Patients 
may give up functions that become problematic and 
take up other leisure activities to maintain their QoL. 
For example, while muscular degenerative disease 
patients may experience ambulatory problems, they can 
still remain independent and thus maintain a reasonable 



level of QoL through the use of a walking frame or 
wheelchair. HRQL measures are unable to cope with 
such adaptations, making it difficult for severely ill or 
disabled patients to show improvement even following 
effective interventions. 

QoL is the primary outcome of relevance and impor- 
tance to patients. When dealing with chronic diseases, 
the aim is frequently stated to be to improve the QoL of 
patients. This is particularly true where therapies cannot 
promise a cure or an extension to life. QoL is not 
intended to be an aid to diagnosis or a guide to the 
most appropriate intervention for a specific patient. 
However, its careful assessment should be able to deter- 
mine which alternative interventions patients as a group 
would prefer within the context of a clinical trial. 
Despite this crucial role, there are several other require- 
ments of clinical trials. For example, a product must be 
shown to improve objective health status and to be 
cost-effective. There are several chronic conditions for 
which steroid treatment improves QoL but does not 
necessarily improve the patient's health status in the 
long term. 

The needs-based model of QoL resulted from analys- 
ing the transcripts from patient interviews conducted 
during the development of the Quality of Life in 
Depression Scale [9]. Figure 2 illustrates how the func- 
tion of employment fits into the needs model. The 
objective purpose of employment is to earn money. 
However, being employed can lead to the satisfaction of 
a wide range of needs. Depressed patients who were 
unable to be employed reported problems with structur- 
ing their days, with identity and status and with reduced 
social interaction. Such needs can also be met in differ- 
ent ways, for example, by doing voluntary work or by 
joining sporting or interest clubs [30]. Research has 
shown that unemployed people who stay active in these 
ways are able to maintain their health [31]. 

The needs-based approach to QoL assessment has a 
number of advantages for measurement of the impact of 
disease and its treatment. Rather than asking directly 
about a function, it is possible to enquire about the 
needs that could be satisfied by that function. For exam- 
ple, questions about sexual performance are frequently 
left unanswered in questionnaires because of their irrele- 
vance or unacceptability. The needs approach allows 
questions to be asked about needs related to sexual 
functioning that can also be satisfied in other ways, such 
as love, intimacy, touching and sharing with another 
person. The needs-based approach also copes well with 
patient adaptation. A chronically ill person can maintain 
a reasonable level of QoL by remaining independent 
through the use of aids and/or assistance. Patients who 
have activity limitations can still be shown to have a 
good QoL, as the concern here is the degree to which 
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they can meet their needs, regardless of how this is 
achieved. 

Measures developed using the needs-based approach 
are disease-specific (or could be more appropriately 
described as disease-relevant). This allows them to focus 
on the specific needs interfered with by the disease and 
hence makes them highly relevant and acceptable to the 
patient. As specific needs may be affected by different 
illnesses, it is possible to develop valid methods of mak- 
ing comparisons between the impacts of different 
diseases. 

A further advantage of the needs-based measures is 
that they assess the single construct of need satisfaction, 
allowing the construction of unidimensional scales or 
indices of QoL. A major problem of HRQL measures is 
that they collect information on a range of different 
types of outcomes. Consequently, they provide a profile 
of scores (see, for example, the NHP [22], the SIP [21] 
and the SF-36 [23]). It is not possible to compare scores 
on the different sections of the profile, and it is certainly 
unacceptable simply to add together responses to the 
different sections to give a single score, although this is 
common practice in outcome measurement. 

Selecting and using PROMs for clinical trials and 
studies 

The inclusion of poorly designed or inadequately tar- 
geted instruments in a clinical trial or study can have 
serious consequences. Furthermore, ethical questions 
are raised by asking patients to complete measures that 
are incapable of demonstrating treatment effects. It is 
strongly recommended that expert help is sought in 
selecting an appropriate PROM. Too often the choice is 
based on issues that are helpful rather than being of 
scientific importance. PROMs may be selected because 
they are commonly used, are used by a competitor or 
are available in a wide range of languages. While such 
factors can be helpful, they are minor compared with 
what the questionnaire actually measures and how well 
it does this. 

When selecting a PROM, it is first necessary to deter- 
mine the constructs that have to be assessed to meet 
the objectives of the study. Having done this, the next 
stage is to find PROs that measure these constructs 
well. It is not advisable to rely on, or to be limited to, 
the PROMs listed in databases such as OLGA [32] or 
the Patient-Reported Outcome and Quality of Life 
Instruments Database (PROQOLID) [33]. Such sources 
of information are often selective and/or omit important 
measures. Furthermore, they rely on test authors to pro- 
vide information on the quality of the measures listed 
without providing any commentary on the acceptability 
of testing methods used or the appropriateness of the 
conclusions drawn. A thorough search of the medical 



literature should be made to find available measures and 
evaluate their suitability for use in the trial. This will 
often generate a host of potential PROMs that will vary 
considerably in terms of the care with which they were 
developed and their psychometric quality. 

Selecting the most appropriate questionnaire requires 
consideration of several key quality standards. These 
cover the development processes, instrument scaling, 
psychometric properties and cultural translation and 
adaptation processes (Figure 4) [6]. These standards are 
described in detail in the Appendix. 

It is increasingly common for trials to include PROMs. 
In some cases, they have been accepted as primary end 
points by the health authorities. However, these PROMs 
actually measure clinical end points (such as pain) that 
cannot be determined objectively. Indeed, the US Food 
and Drug Administration (FDA) prefers these types of 
PROs to be employed, as they appear to be uncomforta- 
ble with more subjective outcomes [34,35]. This con- 
trasts with the European Medicines Agency (EMA), 
which welcomes QoL outcomes that describe the added 
benefits of new products [36]. Both the EMA and the 
FDA emphasise that outcome measures selected for a 
study should be well targeted to the specific patient 
population, which fundamentally rules out the use of 
generic PROMs. It is noticeable that both bodies now 
consider the most widely used PROM, the SF-36, to be 
unsuitable for making claims about the value of treat- 
ments. Indeed, this measure has such poor psychometric 
properties that it has never proved to be a valuable 
instrument for showing differences between active treat- 
ments (see, for example, [37] and [38]). Indeed, it has 
been shown that sample sizes of up to 20,000 per study 
arm would be required for SF-36 domains to be able to 
show such differences [39]. 

Where the instrument is used as a clinical end point 
in a trial and/or is intended to be used to support a pro- 
duct label claim or to provide information for inclusion 
in the Summary of Product Characteristics, it is neces- 
sary to agree in advance with the appropriate authorities 
that data collected with the measure will be acceptable 
to them. This generally involves providing a detailed 
briefing book. The briefing book must include informa- 
tion on how each item was generated and the reasons 
for rejecting items. Evidence is required for the whole 
testing procedure and the development and validation of 
all language versions of the measure to be used in the 
trial. Problems will occur with older measures, where 
such information is unlikely to be available and/or the 
development methodology was inadequate. Where a 
new measure is being developed for a specific trial, it is 
prudent to keep the authority informed at each stage of 
instrument development. The EMA now has a biomar- 
ker qualification system in operation that allows PRO 
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Key Questions 


Judgment Criteria 




Does it measure the required 
construct? 


• The instrument is based on a model or theory of the underlying construct. 


Yes 


- 


No 


— 


Has it been 
well-developed? 


• Items were derived from the appropriate source/population. 

• Clinical input required for symptoms/functioning 

• Patient input required for all constructs 

• Patient only valid source of items for QoL measures 


Yes 




No 






♦ Content is clear and unambiguous 

• Items are not double-barreled or ambiguous 


Yes 


- 


No 


- 




• Instrument is practical to complete 

• Mode of administration is suitable 

• Response options are clear 










Have the scaling properties been 
adequately assessed? 


♦ The scale (or individual subscales) is unidimensional, thus, justifying production of single scale 
(or subscale) score. 

• Factor analysis alone does not provide adequate evidence of unidimensionality. 


Yes 


— 


No 


— 




• Dimensional structure of multi-dimenslonal scales has been demonstrated. 


Yes 


- 


No 


- 




♦ Evidence of level of measurement (ordinal, interval) is available to justify use of appropriate 
statistics for analysis of scale data. 


Yes 




No 




Have the psychometric properties 
been adequately assessed? 


• Reliability (reproducibility) assessed over appropriate time frame (2 weel<s ideal). Minimum 
value of 0.85 achieved for test-retest correlation. 


Yes 


— 


No 


— 




• Internal consistency (Cronbach's alpha) 
• This does not confer reliability. 


Yes 




No 






• Evidence of relevance and content validity 

• Face and content validity assessed with relevant patient group 


Yes 




No 






• Evidence of construct validity is available . 

• Scale association with related constructs assessed 

• Known groups validity assessed 


Yes 




No 






• Evidence of responsiveness? 


Yes 




No 




Have additional language versions 
been appropriately adapted? 


• Evidence of appropriate translation methods. At a minimum : 

• Methods are well documented and transparent 

• Translation has not relied solely on bilinguals but has involved people of average 
educational level within target culture 


Yes 




No 






• Evidence of psychometric and scaling properties. At a minimum: 

• Face and content validity established for target culture 

• Evidence of reliability and construct validity 

• Unidimensionality of scales (or subscales) has been confirmed 


Yes 




No 





Figure 4 Brief checlclist for assessing the quality of PRO instruments. The specific requirements of a good-quality PROM are sliown. Tliese 
qualities should be clearly reported in peer-reviewed publications. In many cases (including that of the most commonly employed PROMs), this 
information is not available. New instrument development methodologies, in particular the establishment of the scaling properties of a measure 
(item response theory), are essential to ensuring the quality of PROMs. 



instruments to be evaluated [40]. Once qualification has 
been achieved, the EMA will accept all data collected in 
a trial that uses the measure. The FDA has also issued a 
draft guidance document covering PRO instrument qua- 
lification [41]. 

Sufficient time should be allowed to ensure that the 
required language versions of a measure have been 
developed and validated (see below). Very often poor 
quality translations are produced, relying on simple for- 
ward-backward translation techniques rather than using 
an approach that involves relevant patients. Adapting 
measures appropriately is a time-consuming procedure 
that needs to be built into trial planning. 

Once an instrument has been selected, it is crucial 
that its value and the reasons for its use are clear to 
everyone involved in the trial. If this is not done, data 
collection with the measure will be of debatable value. 
Staff involved in the trial at each centre will require 
training on the application of the measure and how to 
deal with problems that might arise. 

Development and validation of PRO measures 

Where a search fails to identify a high-quality PROM for 
a trial, it will require that a new questionnaire be devel- 
oped. However, planning for such an event is important, 
as the development process can be time-consuming. 



particularly if several language versions of the measure 
are required. I would argue strongly that the content for 
such a measure should be generated by means of one- 
to-one patient interviews, as the content should be rele- 
vant and acceptable to future patients. 

There are four key stages in instrument development: 
(1) identification of the measurement model, (2) genera- 
tion of questionnaire content, (3) content refinement 
and item reduction and (4) scaling and psychometric 
evaluation. These stages are summarised in Table 2 
[42-44]. 

Adapting PRO measures 

If the required language versions are known from the 
outset, instrument development should be conducted in 
parallel in these countries [45,46]. However, this infor- 
mation is rarely available, and it is more common for 
subsequent language adaptations to be required. Again, 
it is necessary to allow sufficient time for such adapta- 
tions to be produced, as the process can be time- 
consuming. 

Translation procedures 

Translating PROMs is a complex task that cannot be 
undertaken lightly without the risk of producing poor- 
quality adaptations. It is commonly stated that forward- 
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Table 2 Development and validation of QoL measures 

There are four key stages in instrument development: 

► Identification of meosurement model: QoL scales should be based on a 
stated model or theory of QoL. 

► Generation of questionnaire content: Content of all QoL scales should 
be derived from interviews with relevant patients. Both the concerns 
and the wording used in the items should be generated during these 
interviews. Thirty to thirty-five interviews are usually sufficient to 
generate items. Qualitative analysis of the transcripts allows the 
construction of a QoL outcome model for the disease. 

► Content refinement and item reduction: Content validity is assessed by 
comparing the issues covered by the items to the outcome model. 
Retained items should be clearly expressed, address only one issue, 
avoid duplication, be potentially capable of change and apply to all 
respondents. The draft measure should then be tested with a new set 
of patients to check comprehension, ability to answer the measure and 
ensure item relevance. 

► Scaling and psychometric evaluation: Formal testing of dimensionality, 
reproducibility and construct validity should be achieved by means of a 
test-retest survey. In most European countries and North America, the 
survey can be conducted by post. A sample of 100 or more is 
preferable. It is strongly recommended that this stage should employ 
Item Response Theory techniques [42,43]. 

^QoL, quality of life. 

backward translation is the gold standard in translation 
methodology [47]. However, there is no evidence to sup- 
port this view; it is merely a statement of belief. When 
such translation work was first handed to translators, 
test developers felt the need to assess the quality of the 
new version by some sort of 'scientific' method. This led 
to the introduction of forward-backward translation. 
However, such a methodology raises the hackles of 
translators, and not only because it casts inappropriate 
doubts on their abilities. If the translation is good, then 
the back-translation may well look nothing like the 
source questionnaire. Consequently, little information of 
value is obtained by conducting the backward transla- 
tion, while misleading impressions can result. Instead, 
quality should be built into every stage of the translation 
procedure rather than checking it a posteriori. 

Rather than relying on forward-backward translation, a 
dual-panel methodology has been developed and is now 
commonly employed (see Table 3). A recent study has 
shown that the 'dual panel' methodology produces 
translations that are more acceptable to patients in the 
new country than the use of forward-backward transla- 
tion [48]. 

It is important to remember that this is only the start 
of the adaptation process. The new translation should 
then be tested by means of face-to-face interviews with 
several relevant patients to ensure that the adapted ver- 
sion has face and content validity (known as 'cognitive 
debriefing'). Finally, the psychometric properties of the 
adapted questionnaire must be established with new 
patient samples. This requires a test-retest survey to be 
conducted for each new language version produced. 



Table 3 Recommendations for the production of high- 
quality adaptations 

The dual panel method is recommended for producing high-quality 
translations. The following recommendations are made: 

Recruit 'translators' who currently live in the target country and 
whose command of English is good. 

The meeting should be held in the country for which the 
measure is required. 

Five to seven people enable fruitful discussion. 

It is preferable to exclude professional translators. 

An instrument developer should attend this meeting to explain the 
intent of the items and their specific meanings in the context of 
the questionnaire. 

Inform the group of the model underlying the questionnaire, how 
it was developed, its design and its content and target audience. 

Inform the group of the translation requirements (in particular 
accessibility and acceptability of wording). 

The group should work as a team with a co-ordinator whose task 
is to check that none of the parameters are neglected (in 
particular, structural and metric aspects that could be overlooked). 

Allow adequate time for the meeting to explore all issues fully. 

Once the translated version of the instrument is agreed, have it 
assessed by a lay panel, again working as a group: 

The coordinator involved in the first panel should work with 
this panel also to ensure that the original meaning of the 
items and the questionnaire structure are maintained. 

The results of this meeting should be used to make final 
decisions about the wording of the questionnaire. 

The whole procedure should be reported in detail, in 
particular explaining translation choices and changes made 
following lay panel testing. This not only provides information 
on the process undertaken but also constitutes a thorough 
final review. 



Such retesting is rarely undertaken but is necessary to 
show that the new language version works in the same 
way as the original, evidence that is required by the 
FDA [35]. 

Conclusions 

The development, administration, analysis and adapta- 
tion of PROMs must be carried out by highly skilled 
specialists. Too often nonspecialists are given the tasks 
of determining which outcomes should be included in 
clinical studies and trials and how these should be mea- 
sured. Unfortunately, this largely explains why few such 
studies provide useful data. Such a situation represents a 
waste of resources and the opportunity to show the ben- 
efits of expensive new products. A more professional 
approach to assessing PROs is needed. Of particular 
concern is the paucity of QoL studies undertaken, given 
that high-quality measures specific to several diseases 
are available [49]. 

Selecting the best PROMs for a trial should be given 
the same consideration as choosing clinical outcome 
measures. Too often PROMs are selected at too late a 
stage to allow required language adaptations to be 
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produced. Consequently, less suitable measures are often 
selected. It is common for very expensive clinical trials 
to waste the opportunity to assess QoL or other PROs 
appropriately because of lack of planning or unwilling- 
ness to pay for the necessary development work. In rea- 
lity, the cost of such work is minimal in comparison to 
the overall cost of the trial. 

The development of PROMs is far from a common- 
sense procedure. Success is dependent on both expertise 
and experience. Table 4 lists some of the issues covered 
in this article. Many if not most of the points listed are 
counter to the commonsense view on outcome mea- 
surement and instrument development. 

Given the expressed desire of organizations such as 
the FDA and the EMA to be made aware of the benefits 
of treatment from the consumer s perspective and the 
need to convince payers of the added benefit of new 
treatments, it is to be hoped that more attention will be 
paid in future to the assessment of the effects of new 
interventions from the patient's perspective. 

The science of PROMs is developing quickly. For too 
long, outdated generic HRQL measures such as the SF- 
36, NHP and EQ-5D have been relied on in clinical 

Table 4 A new common sense for patient-reported 
outcome assessment^ 

Do not rely on instrument databases for PRO identification and 
selection. 

HRQL consists of symptoms, functions and limited aspects of the 
impact of these. 

HRQL is very different from QoL. 

The needs-based model of QoL is the most widely employed in 
medical research. 

True QoL has rarely been measured in clinical studies and trials. 

The content of QoL measures must be derived from relevant patients. 

PROMs must be simple to administer, complete and score. 

Simple two-point response formats are preferable to multiple response 
formats [43]. 

All PROMs used in clinical trials should be disease-specific. 

Generic PROMs do not allow the impact of different diseases on 
patients to be compared. 

Population norms for PROMs are invalid. 

Think twice before selecting generic measures such as the EQ-5D to 
determine utility estimates, as they have limited psychometric quality. 

QoL is a unidimensional construct. 

Data collected using PROMs must be shown to be unidimensional. 

Scores on subscales can rarely be added together to give a total score. 

High reliability (reproducibility) is crucial to the accuracy of PROMs. 

Forward-backward translation is a flawed methodology, creating 
unnecessary work. 

Think carefully before using PROMs developed in the Western world in 
Asia and Africa. 

Evidence is required of the scalability, reproducibility and construct 
validity of all language versions of PROMs used in a clinical trial. 

^PRO, patient-reported outcome; PROM, patient-reported outcome measure; 
HRQL, health-related quality of life; QoL, quality of life. 



Studies. It is now well understood that such measures are 
inadequate for showing change over time or the different 
impacts of alternative interventions. Greater emphasis is 
now placed on measurement models, disease-specific 
measurement and the application of Item Response The- 
ory rather than Classical Test Theory. Well-developed 
measures are now generally of better quality and are 
more sensitive than many clinical outcome measures. 

The development and use of PROMs have suffered 
from a lack of theory and poor basic development work 
for far too long. We have been willing to continue using 
the same poor generic PROMs because we are familiar 
with them, despite their age, lack of quality and inability 
to do the job for which they are intended. Given the 
cost of clinical trials and the importance of evaluating 
health services from the perspective of the patient, it is 
essential that the quality of PROMs improves. It is also 
time to reject the view that the only valid PROs are 
symptom scores and limited functional assessments 
mimicking clinical outcomes. It is important for PRO 
practitioners to argue strongly on behalf of the patient 
that we should also measure carefully those outcomes 
that really matter to them. 
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Appendix 

Development and validation of patient-reported outcome 
measures 

There are four key stages in instrument development: 
(1) identification of the measurement model, (2) genera- 
tion of questionnaire content, (3) content refinement 
and item reduction and (4) scaling and psychometric 
evaluation. 

Identification of the measurement model 

The requirement for a measurement model appears to 
be common sense: How else is it possible to decide 
which items to include in the measure? However, it is 
astounding how infrequently test developers report the 
measurement model that guided the development of 
their measurement instrument. Measures of symptoms 
or functioning may well be based on the World Health 
Organisation s International Classification of Function- 
ing, Disability and Health classification of impairments 
and activity limitations, respectively [1,2]. A measure of 
quality of life (QoL) is likely to be based on the needs- 
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based model of QoL [9]. The model employed should be 
reported in the instrument development publication to 
allow readers the opportunity to consider whether the 
measure employed is reasonable and practical. 

Generation of questionnaire content 

The development of patient-reported outcome (PRO) 
instruments is a highly skilled activity best undertaken 
by specialists in measurement and psychometrics. It is 
particularly important that the content of these instru- 
ments is generated by researchers experienced in quali- 
tative interviewing techniques. Content for all PRO 
measures (PROMs) should be derived from interviews 
with relevant patients (for QoL scales) or experts and/or 
patients (for measures of health-related quality of life). 
Thus, if a measure of QoL specific to endometriosis is 
required, the content will be derived from qualitative 
interviews conducted with women experiencing the pro- 
blem. Such interviews are not intended to explore issues 
identified in the literature or by clinical experts. Both 
the relevant concerns and the wording used in the inter- 
view questions must be generated during these inter- 
views. This is the most crucial stage of instrument 
development and must be carried out by skilled specia- 
lists. If good-quality questionnaire items are not identi- 
fied, the resulting instrument will be poor. 

The interviews, which may last several hours, should 
be audio-recorded and transcriptions produced. It is 
generally found that 30 to 35 interviews are sufficient to 
generate items for a scale. Additional interviews tend 
not to identify new issues of importance. Interviewees 
will generally raise specific functions that are proble- 
matic for them. The skill of the interviewer is to probe 
such responses carefully to understand how the patient's 
life is impaired by such restricted functioning. The 
needs-based model of QoL grew out of such probing in 
the development of the Quality of Life in Depression 
Scale [9]. Depressed patients who were unable to work 
reported problems with structuring their days, with 
identity and status and with reduced social interaction 
(see Figure 1). 

Qualitative analysis of the transcripts allows the con- 
struction of a PRO model for the disease. This model 
will identify the issues and/or needs that are relevant for 
assessment of patients with the disease studied. The 
analysis will also identify potential items for inclusion in 
the measure. Where possible, it is preferable to keep the 
wording used by interviewees for the items, although 
minor changes may be necessary. Items are then best 
expressed as statements made by patients, such as IVe 
lost interest in food' or 1 feel dependent on other peo- 
ple'. Stating the items in this form leads to a response 
format of yes' or 'no' or 'true' or 'not true'. This is a 



natural way of responding to items that should enquire 
into issues that are clear-cut. The application of modern 
psychometric models (such as Rasch analysis) indicates 
that increasing the number of possible responses for an 
item does not increase the sensitivity of the scale. 
Instead, the final set of items should each represent a 
different amount of the construct measured in the same 
way that the marks on a ruler denote different lengths. 

Content refinement and item reduction 

Patient interviews will identify a large set of potential 
items. Content validity is assessed by comparing the 
issues covered by the items to the outcome model and 
other sources of information about the impact of the 
disease. The first stage of item reduction involves ensur- 
ing that items are clearly expressed. For example, they 
should address only one issue, avoid duplication, be 
potentially capable of change with effective treatment 
(for example, avoiding statements such as 1 worry that 
my illness will become worse') and apply to all respon- 
dents. Items that are not relevant are poor, as they lead 
to ambiguous responses. 

Scaling and psychometric evaluation 

The next stage is to test the draft questionnaire with a 
new set of relevant patients by means of cognitive 
debriefing interviews. These items will explore intervie- 
wee's ability to understand and complete the measure 
and ensure that items are considered relevant. In this 
way, the face validity of the measure will be established. 
Changes in wording can still be made at this stage, and 
items can be removed or added as a result of the 
interviews. 

Formal testing of the questionnaire for dimensionality, 
reproducibility and construct validity is then undertaken 
by means of a test-retest survey. In most European and 
North American countries, the survey can be conducted 
by post. While test-retest reliability (reproducibility) can 
be assessed with a sample of around 50, the need to 
determine the dimensionality of the scale means that a 
sample of 100 or more is preferable. 
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